Support VLM calibration with image-text data#755
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #755 +/- ##
==========================================
- Coverage 74.13% 73.08% -1.05%
==========================================
Files 192 193 +1
Lines 19263 19583 +320
==========================================
+ Hits 14280 14312 +32
- Misses 4983 5271 +288 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
So, we only support image quantization for just nemotron-vl? If yes, why? |
|
@Edwardf0t1 do you have experiments evaluating the accuracy impact of using the new dataset? |
At this time, only Nemotron VL has been tested. We can extend the logic to support other VLMs later. Note that different VLMs may have different forward functions—e.g., the way the vision encoder interacts with the language decoder can vary across models. Do you have a preferred VL model you’d like us to support next? For instance, Qwen3-VL? |
Tested on two benchmarks DocVQA and InfoVQA for Nemotron Nano VL v2 with vLLM backend:
Image-text calibration is only marginally better in these cases, but the calibration flow in this PR should be ready. The follow-up experiments can be
|
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
…for Nemotron-VLM-Dataset-v2 Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
…for Nemotron-VLM-Dataset-v2 Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
794fdfa to
b313f93
Compare
📝 WalkthroughWalkthroughThis pull request introduces Vision-Language Model (VLM) calibration support for post-training quantization. It adds new dataset utilities for streaming Nemotron VLM data, implements image-text pair calibration loops, extends the quantization pipeline to handle multimodal models, and includes documentation and helper functions for Nemotron VL model processing. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant hf_ptq
participant ModelLoader
participant VLMProcessor
participant DataLoader
participant CalibLoop
participant Quantizer
User->>hf_ptq: Execute with --calib_with_images
hf_ptq->>ModelLoader: load_model() with calib_with_images=True
ModelLoader->>ModelLoader: Detect Nemotron VL model
ModelLoader->>VLMProcessor: Create AutoProcessor
VLMProcessor->>VLMProcessor: Configure padding tokens & side
ModelLoader->>ModelLoader: extract_and_prepare_language_model_from_vl()
ModelLoader->>hf_ptq: Return LM + default_pad_token
hf_ptq->>DataLoader: Load nemotron_vlm_dataset_v2
DataLoader->>DataLoader: Stream tar shards + JSONL
DataLoader->>DataLoader: Match images to messages
DataLoader->>hf_ptq: Yield {id, messages, image}
hf_ptq->>CalibLoop: create_vlm_calibration_loop(model, dataloader)
CalibLoop->>CalibLoop: Inspect model.forward signature
loop Per batch
CalibLoop->>CalibLoop: Extract pixel_values, input_ids, attention_mask
CalibLoop->>CalibLoop: safe_nemotron_vl_forward()
CalibLoop->>CalibLoop: Align vision embeddings with img_context_token_id
CalibLoop->>CalibLoop: Run LM forward (no grad, eval mode)
end
hf_ptq->>Quantizer: quantize_main() with calibrated stats
Quantizer->>hf_ptq: Export quantized LM
hf_ptq->>hf_ptq: Restore tokenizer.pad_token
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| return any("nemotron" in arch.lower() for arch in architectures) | ||
|
|
||
|
|
||
| def create_vlm_calibration_loop(full_model, calib_dataloader): |
There was a problem hiding this comment.
do we want to move it to vlm_dataset_utils.py? In the dataset_utils.py we create the llm calibration loop
There was a problem hiding this comment.
I prefer to keep it here as it’s a PTQ calibration loop tied to the example workflow and nemotron_vl_calib, not a dataset utility. Moving it into vlm_dataset_utils.py would mix concerns (data loading vs. calibration execution) and could introduce awkward dependencies (a core util importing example-only logic).
We may create a new modelopt/torch/utils/vlm_calib_utils.py to host it but I don’t think it’s needed right now.
| if args.calib_with_images and is_nemotron_vl_model: | ||
| calibrate_loop = create_vlm_calibration_loop(full_model, calib_dataloader) | ||
| else: | ||
| calibrate_loop = create_forward_loop(dataloader=calib_dataloader) |
There was a problem hiding this comment.
e.g. this is imported from the dataset_utils.py
## What does this PR do? **Type of change:** New feature **Overview:** The primary goal of this PR is to allow the model optimizer to use image-text pair data during the calibration phase of quantization, which is likely help improve accuracy of quantized VLMs like Nemotron VL on visual understanding tasks particularly, compared to text-only calibration data. - New Feature: Adds support for VLM calibration specifically using image-text data. - Dataset Integration: Introduces support for sampling from the `Nemotron-VLM-Dataset-v2`. - Refactoring: Created a separate utility for VLM datasets to keep the main Hugging Face PTQ script (`hf_ptq.py`) clean. - Simplified logic for handling multimodal inputs. - Addressed specific issues encountered when calibrating the `Nemotron-Nano-VL-12B-V2` model with image data. - Documentation: Updated the README to include instructions and examples for VLM calibration. This PR complements #347 and we will consolidate llm_ptq and vlm_ptq examples in follow-up PRs. ## Usage <!-- You can potentially add a usage example below. --> ```python python3 hf_ptq.py --pyt_ckpt_path /home/scratch.omniml_data_2/models/Nemotron-Nano-VL-12B-V2 --qformat nvfp4 --export_path /home/omniml_data_3/zhiyuc/checkpoints/Nemotron-Nano-VL-12B-V2-NVFP4-doccalib --trust_remote_code --kv_cache_qformat none --calib_with_images --vlm_dataset nemotron_vlm_dataset_v2 --vlm_subsets sparsetables,plotqa_cot --calib_size 512 ``` ## Testing <!-- Mention how have you tested your change if applicable. --> ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes <!--- If No, explain why. --> - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Not yet <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. --> ## Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added Vision-Language Model (VLM) calibration support with image-text pair data, specifically for Nemotron VL models. * Added new `--calib_with_images` CLI flag to enable image-based calibration workflows. * Integrated Nemotron VLM dataset v2 for streaming multimodal calibration data. * **Documentation** * Added VLM calibration guidance in the PTQ README with usage examples and dataset information. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
## What does this PR do? **Type of change:** New feature **Overview:** The primary goal of this PR is to allow the model optimizer to use image-text pair data during the calibration phase of quantization, which is likely help improve accuracy of quantized VLMs like Nemotron VL on visual understanding tasks particularly, compared to text-only calibration data. - New Feature: Adds support for VLM calibration specifically using image-text data. - Dataset Integration: Introduces support for sampling from the `Nemotron-VLM-Dataset-v2`. - Refactoring: Created a separate utility for VLM datasets to keep the main Hugging Face PTQ script (`hf_ptq.py`) clean. - Simplified logic for handling multimodal inputs. - Addressed specific issues encountered when calibrating the `Nemotron-Nano-VL-12B-V2` model with image data. - Documentation: Updated the README to include instructions and examples for VLM calibration. This PR complements #347 and we will consolidate llm_ptq and vlm_ptq examples in follow-up PRs. ## Usage <!-- You can potentially add a usage example below. --> ```python python3 hf_ptq.py --pyt_ckpt_path /home/scratch.omniml_data_2/models/Nemotron-Nano-VL-12B-V2 --qformat nvfp4 --export_path /home/omniml_data_3/zhiyuc/checkpoints/Nemotron-Nano-VL-12B-V2-NVFP4-doccalib --trust_remote_code --kv_cache_qformat none --calib_with_images --vlm_dataset nemotron_vlm_dataset_v2 --vlm_subsets sparsetables,plotqa_cot --calib_size 512 ``` ## Testing <!-- Mention how have you tested your change if applicable. --> ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes <!--- If No, explain why. --> - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Not yet <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. --> ## Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added Vision-Language Model (VLM) calibration support with image-text pair data, specifically for Nemotron VL models. * Added new `--calib_with_images` CLI flag to enable image-based calibration workflows. * Integrated Nemotron VLM dataset v2 for streaming multimodal calibration data. * **Documentation** * Added VLM calibration guidance in the PTQ README with usage examples and dataset information. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com> Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Can we also iinclude Qwen3.5-VL? |
What does this PR do?
Type of change: New feature
Overview:
The primary goal of this PR is to allow the model optimizer to use image-text pair data during the calibration phase of quantization, which is likely help improve accuracy of quantized VLMs like Nemotron VL on visual understanding tasks particularly, compared to text-only calibration data.
Nemotron-VLM-Dataset-v2.hf_ptq.py) clean.Nemotron-Nano-VL-12B-V2model with image data.This PR complements #347 and we will consolidate llm_ptq and vlm_ptq examples in follow-up PRs.
Usage
Testing
Tested on two benchmarks DocVQA and InfoVQA for Nemotron Nano VL v2 with vLLM backend:
Image-text calibration is only marginally better in these cases, but the calibration flow in this PR should be ready. The follow-up experiments can be
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
New Features
--calib_with_imagesCLI flag to enable image-based calibration workflows.Documentation
✏️ Tip: You can customize this high-level summary in your review settings.